Bayesian Distributional Policy Gradients

نویسندگان

چکیده

Distributional Reinforcement Learning (RL) maintains the entire probability distribution of reward-to-go, i.e. return, providing more learning signals that account for uncertainty associated with policy performance, which may be beneficial trading off exploration and exploitation in general. Previous works distributional RL focused mainly on computing state-action-return distributions, here we model state-return distributions. This enables us to translate successful conventional algorithms are based state values into RL. We formulate Bellman operation as an inference-based auto-encoding process minimises Wasserstein metrics between target/model return The proposed algorithm, BDPG (Bayesian Policy Gradients), uses adversarial training joint-contrastive estimate a variational posterior from returns. Moreover, can now interpret prediction information gain, allows obtain new curiosity measure helps steer actively efficiently. demonstrate suite Atari 2600 games MuJoCo tasks, including well known hard-exploration challenges, how learns generally faster higher asymptotic performance than reference algorithms.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Distributional Deterministic Policy Gradients

This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple impr...

متن کامل

Bayesian Policy Gradients via Alpha Divergence Dropout Inference

Policy gradient methods have had great success in solving continuous control tasks, yet the stochastic nature of such problems makes deterministic value estimation difficult. We propose an approach which instead estimates a distribution by fitting the value function with a Bayesian Neural Network. We optimize an α-divergence objective with Bayesian dropout approximation to learn and estimate th...

متن کامل

Bayesian Optimization with Gradients

Bayesian optimization has been successful at global optimization of expensiveto-evaluate multimodal objective functions. However, unlike most optimization methods, Bayesian optimization typically does not use derivative information. In this paper we show how Bayesian optimization can exploit derivative information to find good solutions with fewer objective function evaluations. In particular, ...

متن کامل

Bayesian structured additive distributional regression

In this paper, we propose a generic Bayesian framework for inference in distributional regression models in which each parameter of a potentially complex response distribution and not only the mean is related to a structured additive predictor. The latter is composed additively of a variety of different functional effect types such as nonlinear effects, spatial effects, random coefficients, int...

متن کامل

Hindsight policy gradients

Goal-conditional policies allow reinforcement learning agents to pursue specific goals during different episodes. In addition to their potential to generalize desired behavior to unseen goals, such policies may also help in defining options for arbitrary subgoals, enabling higher-level planning. While trying to achieve a specific goal, an agent may also be able to exploit information about the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i10.17024